Assessing the Performance of Chat Generative Pretrained Transformer (ChatGPT) in Answering Andrology-Related Questions

Objective: The internet and social media have become primary sources of health information, with men frequently turning to these platforms before seeking professional help. Chat generative pretrained transformer (ChatGPT), an artificial intelligence model developed by OpenAI, has gained popularity as a natural language processing program. The present study evaluated the accuracy and reproducibility of ChatGPT's responses to andrology-related questions. Methods: The study analyzed frequently asked andrology questions from health forums, hospital websites, and social media platforms like YouTube and Instagram. Questions were categorized into topics like male hypogonadism, erectile dysfunction, etc. The European Association of Urology (EAU) guideline recommendations were also included. These questions were input into ChatGPT, and responses were evaluated by 3 experienced urologists who scored them on a scale of 1 to 4. Results: Out of 136 evaluated questions, 108 met the criteria. Of these, 87.9% received correct and adequate answers, 9.3% were correct but insufficient, and 3 responses contained both correct and incorrect information. No question was answered completely wrong. The highest correct answer rates were for disorders of ejaculation, penile curvature, and male hypogonadism. The EAU guideline-based questions achieved a correctness rate of 86.3%. The reproducibility of the answers was over 90%. Conclusion: The study found that ChatGPT provided accurate and reliable answers to over 80% of andrology-related questions. While limitations exist, such as potential outdated data and inability to understand emotional aspects, ChatGPT's potential in the health-care sector is promising. Collaborating with health-care professionals during artificial intelligence model development could enhance its reliability.


Introduction
Men may be reluctant to discuss their health problems, concerns, and fears with health professionals.This hesitation is especially evident in issues related to men's health. 1 The internet and social media are frequently used by patients as sources of health information.Men especially turn to the internet for health information before professionals. 2 Social media (Youtube, Facebook, Instagram, Twitter, etc.) and artificial intelligence (AI) applications that have become popular in recent years are the first sources that come to mind in this regard.
Chat generative pretrained transformer (ChatGPT), a natural language processing program developed by OpenAI, is one of the most frequently used AI programs. 3The increasing popularity of ChatGPT has paved the way for research on the effectiveness of its application in the field of health.Bonetti et al 4 showed that ChatGPT achieved a high success rate of 87% accuracy in the Italy Residency Admission National Exam.In another study, Deiana et al 5 demonstrated that ChatGPT provided a high rate of correct answers to questions related to public health. 5though some studies have shown the effectiveness of ChatGPT on different medical topics, its adequacy for questions related to andrology and men's health has not been previously evaluated.In this study, we aimed to evaluate the adequacy of ChatGPT responses to questions related to andrology.

Material and Methods
Frequently asked questions about andrology by patients on health forums, hospital websites, and social media (YouTube and Instagram) were analyzed.In selecting the sources for determining the questions, we aimed for a diverse and representative sample of platforms where patients typically seek information related to andrology.Only questions in English were included in the study.Questions were categorized by topic (male hypogonadism, erectile dysfunction, disorders of ejaculation, low sexual desire and male hypoactive sexual desire disorder, penile curvature, penile size abnormalities and dysmorphophobia, priapism, and male infertility).In addition, the recommendation tables of the Sexual and Reproductive Health section of the 2023 European Association of Urology (EAU) guidelines were analyzed. 6Those with a strong recommendation level were translated into a question form.All questions were asked of the ChatGPT 3.5 August version in English.The responses generated by AI were noted.All questions were asked again twice at different times during the day to evaluate the reproducibility of the answers.
Responses were evaluated by 3 urologists experienced in andrology.The reviewers scored the responses on a scale of 1-4.The reviewers scored the responses in comparison to how they would have answered if they had been asked this question by a patient.

4:
Correct and adequate answer (no further information to add) 3: Correct answer but insufficient (more detailed explanation required) 2: Accurate and misleading information together 1: Wrong or irrelevant answer The median score was recorded for questions in which not all reviewers gave the same score.Repeatability was defined as the consistency of the answer given to the same question at different times.Responses generated at different times were considered reproducible if they received the same score.The median score was noted for questions that received different answers when repeated.Questions that varied from person to person were not included in the study (e.g., I am 35 years old; can I have children?).Other exclusion criteria were questions with similar meanings, questions that did not conform to language rules, and non-medical questions.Since no patient data were used in the study, ethics committee approval was not required.

Statistical analysis
Excel version 16.0 (Microsoft Corp.; Washington, USA) was used for the statistical analyses.The scores of the responses were expressed as n (%).Reproducibility of responses was expressed as %.

MAIN POINTS
• The present study showed that chat generative pretrained transformer gave highly accurate answers to frequently asked questions by patients about andrology.• In addition, the AI model performed similarly satisfactorily in questions of evidence-based medicine.• Artificial intelligence technology can take an important place in the health sector in the future.

Results
The flowchart for the questions included in the study is shown in Figure 1.Of the 136 questions evaluated, 28 did not meet the inclusion criteria.Answers to 108 questions were included in the study (Supplementary Table 1).Ninety-five (87.9%) of the questions were answered correctly and adequately.Ten answers (9.3%) were correct but inadequate, and 3 answers contained both correct and incorrect information.No question was answered incorrectly.The topics of the questions with the highest rate of correct answers were disorders of ejaculation (92.9%), penile curvature (92.9%), and male hypogonadism (91.7%).Erectile dysfunction, penile size abnormalities, and male infertility each had 1 correct answer and 1 incorrect answer.Eighty questions were prepared according to EAU guideline recommendations (Supplementary Table 2).Of the questions, 86.3 were completely correct.Eight questions (10%) received 3 points, and 3 questions (3.8%) received 2 points.All questions regarding ejaculation disorders were answered correctly.The lowest completely correct response rate was for questions about male hypogonadism.Similar to the frequently asked questions, there were no completely wrong answers in the guideline recommendations (Table 1).
The reproducibility rates of the answers to the questions are shown in Figure 2. The AI model produced similar answers for all questions related to erectile dysfunction, disorders of ejaculation, low sexual desire, penile curvature, penile size abnormalities, and male infertility.The similarity rates for the answers to questions about male hypogonadism and priapism were 91.7%.The similarity rate for the questions prepared according to the EAU guideline recommendations was 97.2%.

Discussion
Artificial intelligence models stand out with their increasing use in many areas of life.The utilization of AI within the realm of health-care has emerged as a notably significant topic in recent times. 7This study evaluated the accuracy and reliability of ChatGPT in responding to questions related to andrology.Some research aimed at evaluating the feasibility of utilizing AI-driven platforms to address patients' questions and concerns has been conducted in various medical fields.Lee et al 8 revealed that ChatGPT provided satisfactory responses to patient questions about colonoscopy.Samaan et al 9 showed that the model gave 86.8% correct answers to questions related to bariatric surgery.Our study revealed the impressive performance of ChatGPT in questions related to andrology.
Social media has taken its place as one of the most important information sources for people in the field of health, as in many different fields.Zaila et al 10 evaluated the characteristics of YouTube videos related to men's health.They found that, in addition to quality videos, there are many videos with advertising purposes and biases.Dubin et al 11 found that only about 10% of TikTok and Instagram content related to men's health was uploaded by health professionals.They stated that the remaining videos contained a high rate of misinformation. 11Our study showed that the answers generated by ChatGPT were correct at a rate of 87.9%.The ability of AI software to access the literature and its self-improving structure are important reasons for the high rate of correct answers.In addition, the fact that the information provided is prepared without the concern of advertisements minimizes the bias rate of the answers.
Our results showed that ChatGPT gave highly accurate answers to questions based on EAU guideline recommendations, as well as questions frequently asked by patients.Since the guidelines contain comprehensive and detailed medical information, the success of AI in this regard is remarkable.Ali et al 12 tested ChatGPT with neurosurgery written board examinations questions and showed that the application obtained enough points to pass the exam. 12Panthier et al 13 applied the ophthalmology board exam to ChatGPT in the French language and showed that AI was 91% successful.ChatGPT can provide accurate answers to challenging medical questions thanks to its access to many scientific articles and book contents.Its ability to generate answers in many commonly spoken languages also allows it to appeal to a wide population.
An online information source should be easily understandable and easily accessible, apart from its high accuracy rate.ChatGPT's answers to the questions are in colloquial language and easy to understand. 14nderstanding the information accessed from search engines can be tiring for patients.In addition, the reproducibility of the answers generated by AI is another important issue.Yeo et al 15 showed that ChatGPT's answers to frequently asked questions about cirrhosis and hepatocellular carcinoma were reproducible at around 90%. 15 Our results showed that ChatGPT's answers to questions related to andrology were highly reproducible.
Although ChatGPT has a high rate of correct answers to questions related to andrology, it has some inadequacies.The application only includes data from 2021 and before.Since the literature on andrology is constantly renewed, ChatGPT may be insufficient to access up-to-date information.In addition, we do not know the details of the database used by the application.The model does not have personal medical experience.Therefore, it may be difficult for ChatGPT to understand important aspects, such as patient experiences, emotional states, or subjective approaches.Terms, procedures, and details in the medical field are extremely complex.It may be difficult to understand and correctly use this specialized terminology and details.Finally, medical topics often involve personal and sensitive information.A model such as ChatGPT may have limitations in understanding ethical and confidentiality rules and producing sensitive responses.
Our study has some limitations.First, the questions selected for analysis may not fully represent the scope of the research encountered in clinical practice.Categorization of questions according to specific andrological topics may cause bias in the selection process.The categories selected for analysis may not cover the full range of andrological concerns expressed by patients.While responses from the ChatGPT were evaluated by experienced urologists, human judgment was inherently subjective.While efforts have been made to minimize this subjectivity by having more than 1 urologist evaluate the responses, differences in interpretation may still affect the scoring process.Factors such as updates to the model or changes in training data could potentially affect response consistency.Additionally, the questions were evaluated only in English; similar quality responses may not be received to questions asked in other languages.We did not assess intraobserver and interobserver variability among reviewers.Finally, the assessment of response quality was based on a 4-point scoring system, which, despite its structure, can still be influenced by the examiner's subjectivity.
The results of the current study showed for the first time that ChatGPT provided adequate answers to over 80% of the andrology FAQs.
Although it has limitations, it can be predicted that it will take an important place in the health sector in the future, as it is a constantly developing platform.Getting support from health-care professionals in the development of AI models can increase the reliability of the model.

Figure 1 .
Figure 1.Flowchart of questions included in the study.

Figure 2 .
Figure 2. Similarity rates of answers to questions.EAU, European Association of Urology.

Table 1 .
Grade of Responses By the Chat Generative PretrainedTransformer to the Questions Related to Andrology

Table 2 .
Questions Related to Guideline Recommendations by the European Association of Urology (EAU) (Contniued) Is it considered acceptable to use grafts for penile girth enhancement?62-Which method or factor can aid in determining the subtype of priapism?63-What are the recommended laboratory testing for diagnosing ischemic priapism?64-What procedure or action should be performed when planning embolization for the management of non-ischemic priapism?65-How should the management of ischemic priapism be initiated?66-In cases of priapism secondary to intracavernous injections of vasoactive agents, what is the initial step that should be taken?

Table 2 .
Questions Related to Guideline Recommendations by the European Association of Urology (EAU) (Contniued)

Table 2 .
Questions Related to Guideline Recommendations by the European Association of Urology (EAU) (Contniued) European Association of Urology (EAU) Guideline Questions 67-Should ischaemic priapism associated with sickle cell disease be treated in the same manner as other cases of ischaemic priapism?68-If a shunt procedure has been performed, is it recommended to delay the implantation of a penile prosthesis?69-What are the recommended treatment for stuttering priapism?70-What are the essential components of a male infertility evaluation?71-In which condition should standard karyotype analysis and genetic counseling be considered for diagnostic purposes in semen analysis?72-Is it useful to test for Y-chromosome microdeletions in men with pure obstructive azoospermia?73-When might Y-chromosome microdeletion testing be offered to men? 74-What should be attempted in patients with complete deletions that include the aZFa and aZFb regions?75-In men with structural abnormalities of the vas deferens (unilateral or bilateral absence without renal agenesis), what should both the man and his partner be tested for?76-What should be performed in the assessment of couples experiencing recurrent pregnancy loss from natural conception and assisted reproductive technology (ART), as well as in men with unexplained infertility?77-What should be performed if there is suspicion of a partial or complete distal obstruction?78-Should imaging be considered for detecting renal abnormalities in men who have structural abnormalities of the vas deferens and no evidence of cystic fibrosis transmembrane conductance regulator abnormalities?79-How should hypogonadotropic hypogonadism (secondary hypogonadism), including congenital causes, be treated?80-Which factors should be examined to determine if they can interfere with testosterone production or action?(Continued)